Connect RStudio with the Python 3 Anaconda installation.

conda_list()
##        name                                                 python
## 1 anaconda3             /Users/simonheuberger/anaconda3/bin/python
## 2      TEST   /Users/simonheuberger/anaconda3/envs/TEST/bin/python
## 3    pa_rep /Users/simonheuberger/anaconda3/envs/pa_rep/bin/python
use_condaenv("anaconda3")

Install the needed Python libraries.

import numpy as np # numerical computing library
import pandas as pd # wrangling of data

# The rest is from sklearn, an excellent machine learning library
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

Load and look at the data in Python.

dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep = ";")
print(data.head())
##    fixed acidity  volatile acidity  citric acid  ...  sulphates  alcohol  quality
## 0            7.4              0.70         0.00  ...       0.56      9.4        5
## 1            7.8              0.88         0.00  ...       0.68      9.8        5
## 2            7.8              0.76         0.04  ...       0.65      9.8        5
## 3           11.2              0.28         0.56  ...       0.58      9.8        6
## 4            7.4              0.70         0.00  ...       0.56      9.4        5
## 
## [5 rows x 12 columns]

It doesn’t really give us that much information. R is better here.

py$data %>%
    as_tibble() %>%
    glimpse()
## Observations: 1,599
## Variables: 12
## $ `fixed acidity`        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, …
## $ `volatile acidity`     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,…
## $ `citric acid`          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,…
## $ `residual sugar`       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2…
## $ chlorides              <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,…
## $ `free sulfur dioxide`  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15…
## $ `total sulfur dioxide` <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, …
## $ density                <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0…
## $ pH                     <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,…
## $ sulphates              <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,…
## $ alcohol                <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, …
## $ quality                <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,…

What can we see? All features are numeric (dbl). The target of our predictions is quality. The predictors are fixed acidity, chlorides, pH etc. Let’s conduct the machine learning estimation with Python, to take advantage of the sklearn library.

# Separate data into X (predictive) and y (target) variables
y = data.quality
X = data.drop("quality", axis=1)

# Split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y,
  test_size    = 0.2,
  random_state = 123,
  stratify     = y
)

# Preprocess by calculating the scale from X_train
scaler = preprocessing.StandardScaler().fit(X_train)

# Apply transformation to X_test
X_test_scaled = scaler.transform(X_test)

# Setup an ML pipeline where numeric values are scaled and a random forest
# regression model is created
pipeline = make_pipeline(
    preprocessing.StandardScaler(),
    RandomForestRegressor(n_estimators = 100)
)

# We want to get the optimal combination of parameters. Set up a hyperparameters
# object that has the combination of attributes we want to change
hyperparameters = {
    "randomforestregressor__max_features" : ["auto", "sqrt", "log2"],
    "randomforestregressor__max_depth"    : [None, 5, 3, 1]
}

# Then apply grid search with cross validation
clf = GridSearchCV(estimator = pipeline, param_grid = hyperparameters, cv = 10)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

Is our model any good? Let’s visualize things in R to find out.

# Manipulate data for ggplot
results_tbl <- tibble(
    y_test = py$y_test,
    y_pred = py$y_pred
    ) %>%
    rowid_to_column() %>%
    arrange(y_test) %>%
    mutate(rowid = as_factor(as.character(rowid))) %>%
    rowid_to_column("sorted_rowid") %>%
    gather(key = "key", value = "value", -c(rowid, sorted_rowid))

# Make ggplot
p <- results_tbl %>%
  ggplot(aes(sorted_rowid, value, color = key)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  theme_tq() +
  scale_color_tq() +
  labs(
    title = "Prediction Versus Actual",
    subtitle = "Wine Quality Level",
    x = "Sorted RowID", y = "Quality Level"
    )

p

The above is a static version. What if we would like to be able to analyze the plot interactively in our document?

ggplotly(p)

Note that the interactive features of plotly only work in html. It will show as a static document in a pdf. So what about our model? The red and black lines look pretty good, but the gap is clearly wider at both ends. The model seems to be predicting low wine quality levels higher than they should be and high wine quality levels lower than they should be. Thus, our model still needs improving to more accurately predict very bad and very good wines.